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I INTRODUCTION 


The  contents  of  this  report  represent  the  culmination  of  a program 
to  develop  the  technology  and  the  data  analysis  methods  for  Identifying 
oils  by  the  use  of  field  ionization  mass  spectrometry.  This  program  was 
supported  by  contracts  DOT-CG-22996-A  and  DOT-Cp-81-74-1187  with  the 
U.  S.  Coast  Guard  during  the  period  June  26,  1972  to  May  17,  1976.  Our  efforts 
involved  primarily  the  use  of  two  spectrometer  systems  which  were  as- 
sembled, for  the  most  part,  with  commercially  available  components  and  capable  of 

producing  spectra  of  crude  and  refined  oils  spanning  a mass  range  of  400  amu 
with  the  ability  to  reproduce  spectra  to  better  than  ± 10%. 

The  problems  of  analyzing  spectral  data  are  many  and  are  not  limited 
to  field  ionization  mass  spectrometry.  Among  the  more  difficult  problems 
for  this  program  were  the  detection  and  deconvolution  of  fused  peaks  and 
problems  associated  with  the  variability  in  peak  resolution  of  our  spectrom- 
eters at  ion  masses  greater  than  300  amu.  Perhaps  the  most  difficult 

i 

i 

problem  was  the  unavoidable  existence  of  small,  but  significant,  systematic 

errors  which,  for  practical  reasons,  prevented  us  from  making  full  use  of 

all  the  Information  inherent  in  a spectrogram.  A complete  solution  to  the 

fused  peak  and  variable  resolution  problems  came  recently  as  a result  of 

o 

the  design  of  a third  spectrometer,  comprising  a 60  sector  magnet  mass 
analyzer  and  a new  field  ionization  source  (funded  by  another  project). 

Resolution  problems  associated  with  the  previous  two  systems  have  been  avoided 
by  the  use  of  a spectral  representation  that  is  independent  of 
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resolution.  The  problems  presented  by  systematic  errors  have  been  avoided 
by  abandoning  our  previous  statistical  analysis  model  in  favor  of  a modified 
"learning  machine"  discriminant  function.  The  application  of  this  dis- 
criminant function  to  the  "resolution-independent"  renditions  of  154 
quadrupole  spectra  comprising  the  representations  of  35  different  oils 
produced  correct  and  unambiguous  identifications  for  95%  of  the  spectra. 

The  discriminant  function  model  can  also  be  used  for  ranking  spectra  in  the 
order  of  their  similarity.  This  technique  was  applied  to  the  older  quadrupole 
data  and  to  a set  of  42  spectra  produced  by  the  new  sector  magnet  spectrometer. 
These  results  are  voluminous  and  are  presented  in  separate  binders. 

The  "Review  of  Data  Analysis  Techniques"  describee  several  statistical 
models  that  proved  to  be  unsuccessful  or  impractical  for  the  analysis  of 
mass  spectra.  The  "Statistical  Model"  described  in  Section  III  is  based 
on  the  assumption  of  stochastic  Independence,  a condition  that  has  not 
been  satisfied  by  our  data.  This  model  did,  however,  serve  as  a useful 
background  for  the  development  of  the  "Empirical  Model". 

As  this  project  matured,  it  became  apparent  that  the  cost  of  acquiring 
sufficient  data  to  serve  as  training  sets  in  a "learning  machine"  approach 
or  for  use  in  a statistical  model  would  be  prohibitive  for  routine  applications. 
The  empirical  model,  our  final  approach  to  the  analysis  of  mass  spectrometrlc 
oil  "fingerprints",  may  provide  the  Coast  Guard  with  a simple  and 


2 


objective  measure  of  the  similarity  between  two  spectra  that  could  be 
used  within  the  constraints  imposed  by  systematic  error,  variable 
resolution  and  economic  ccxisiderations. 
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II  HISTORICAL  PERSPECTIVE 


Review  of  Hardware  Development 

The  first  spectrometer  that  was  designed  and  constructed  for  this  project 

included  the  multipoint  field  ionization  source  with  wire  mesh  grid,  shown 

R X 2 

schematically  in  Figure  1,  and  an  ExB  field  velocity  filter,  the  Colutron  . ’ 

system  is  shown  schematically  in  Figure  2 and  was  described  at  the  Sixth  Inter- 

3 

national  Mass  Spectrometry  Conference.  The  mass  range  was  scanned  by  varying 

the  electric  field  of  the  velocity  filter  linearly  with  time  by  means  of  a 

ramp  generator  and  high  voltage  operational  amplifiers.  The  individual 

spe  at  resulted  from  repetitively  scanning  the  mass  range  as  the 

^n  evaporated  into  the  ionizer  were  integrated  into  a single  spectrum 

4 

in  a 4096  channel  multiscaler.  in  an  effort  to  optimize  the  stability  of 
the  data  acquisition  system,  the  multiscaler  was  triggered  by  a voltage 
marker  in  the  velocity  filter  circuit.  In  this  manner  the  first  channel 
always  received  counts  corresponding  to  ions  of  the  same  mass/charge  ratio. 
Figure  3 shows  examples  of  the  spectra  that  were  produced  with  this  system. 

Note  the  nonlinearity  of  the  mass  scale.  This  is  due  to  the  proportionality 

/ . 2 
between  m/e  and  1/E  . 

A modified  version  of  the  above  system  is  shown  schematically  in 
Figure  4.  In  this  case  the  ion  beam  was  focused  onto  a slit  at  the  exit 
of  the  extractor  lens  which  acted  as  the  object  whose  image  was  focused 
by  the  elnzel  lens  onto  the  plane  of  the  electron  multiplier  ion  detector. 

The  ramp  voltage  supplied  by  the  multiscaler  as  a "scope  sweep"  was  used 
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FIGURE  3 MASS  "FINGERPRINT"  COMPARISONS  OF  FUEL  OIL  SAMPLES 
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FIGURE  4 SCHEMATIC  OF  FIELD  IONIZATION  FINGERPRINT  APPARATUS 


in  place  of  the  external  function  generator  voltage  of  the  first  system. 

The  linear  ramp  voltage  was  used  as  the  input  to  an  inverse  square  root 

circuit  whose  subsequent  output  was  a voltage  that  varied  inversely 

with  the  square  root  of  time.  In  this  manner,  the  proportionality  of  m/e 
, 2 

with  1/E  was  transformed  into  a linear  function  of  time.  FiguresS  and 

6 are  examples  of  the  spectra  produced  by  this  system.  The  improvement  in 

resolution  is  due  to  the  modification  in  ton  optics  and  to  the  fact  that 

the  velocity  filter  E - field  sweep  and  the  multiscaler  memory  address 

sweep  were  now  controlled  by  the  same  clock.  Note  also  the  linear  mass  scale. 

The  modified  EXB  field  spectrometer  was  described  in  the  International 

5 

Journal  of  Mass  Spectrometry  and  Ion  Physics. 

Notwithstanding  the  improved  performance  of  the  velocity  filter 

spectrometer,  its  operation  was  difficult  due  to  the  necessity  of  floating 

4 

the  ionization  source  at  10  volts  and  due  to  the  difficulty  in  tuning  it 
for  wide  mass  range  operation.  We  therefore  assembled  the  quadrupole 
spectrometer  system  shown  schematically  in  Figure  7.  Figure  8 is  a 
representative  example  of  the  quality  of  spectra  produced  by  the  quadrupole 
analyzer. 

The  superior  resolution  of  the  velocity  filter  is  evident  in  a comparison 
of  Figures  5 and  6 with  Figure  8.  However,  the  quadrupole  was  mora  reliable 
as  Indicated  by  a comparison  of  the  standard  deviation  data  shown  in 
Tables  1 and  2.  Since  the  significance  of  physical  measurements  varies 
approximately  with  the  Inverse  of  the  standard  deviation,  the  quadrupole 
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FIGURE  5 WIDE  MASS  SPECTRA  OF  STATESBURG,  MISSOURI  CRUDE  OIL 

Upper  trace  shows  spectrum  of  molecular  fragment  ions  produced  by 
sustained  electric  discharge  in  source. 


FIGURE  6 TWO  FINGERPRINTS  OF  A SHELL  NO.  6 FUEL  OBTAINED  3 MONTHS  APART 
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FIGURE  7 OUADRUPOLE  INTEGRATING  MULTISCANNING  FIELD 
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FIGURE  8 QUADRUPOLE  SPECTRA  OF  SHELL  NO.  6 FUEL  OIL 
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TABLE  1 Reproducibility  Data  for  Quadrupole 
Mass  Analyzer 

o,  averaged  over  all  6 oils  = 6,77% 


NAME 

NSPEC 

NPEAK 

a.  % 

Shell  No- 

16 

33 

6.90 

Gulf  No.  6, 

7,57 

Philadelphia 
Gulf  No.  6, 

13 

34 

Cincinnati 

9 

49 

9,07 

Quirequire  Crudi 
Venezuela 

■ 10 

35 

4,76 

Zelten  Crude, 
Libya 

12 

33 

4.50 

Bolivar  Crude 

10 

30 

7.85 

NSPEC  = number  of  spectra  analyzed  per  oil 
NPEAK  = number  of  mass  peaks  analyzer  per  spectrum 
^ = standard  deviation,  averaged  over 

(NSPEC)  X (NPEAK)  peaks  per  oil 
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TABLE  2 Reproducibility  Data  for  Colutron  Velocity  Filter 
Q,  averaged  over  all  6 oils  = 18. 27° 


NAME 

NSPEC 

a 1 % 

Shell  No.  6 

7 

37.2 

Gulf  No.  6,  Santa  Fe  Springs 

6 

12.3 

Gulf  No.  6,  Port  Arthur 

5 

11.1 

Quirequire  Crude,  Venezuela 

6 

17.7 

Zelten  Crude,  Libya 

5 

13.0 

Statesburg  Crude,  Missouri 

5 

17.7 

NSPEC  = sample  size  = number  of  spectra 
(j  = relative  standard  deviation  averaged  over  peaks 
at  9 different  mass  numbers 
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was  chosen  as  the  preferred  analyzer.  Figures  5,  6 and  8 also  domonstrate 


a problem  that  was  common  to  both  systems,  the  degradation  of  resolution 
at  mass  numbers  exceeding  approximately  280  amu  for  the  quadrupole  and 
300  amu  for  the  velocity  filter. 

Normally,  field  Ionization  sources  are  "activated"  by  growing  carbon 

whiskers  on  the  tips  of  the  points  (see  Figure  1)  by  using  the  source  to 

ionize  certain  hydrocarbons  (e.g. , toluene).  Deactivation  consists  of  the 

removal  of  these  whiskers  by  oxidizing  constituents.  The  most  persistent 

problem  encountered  in  this  program  was  an  incompatibility  between  the 

multipoint  field  ionization  source  and  some  of  the  heavier  constituents  of 

oils.  Each  specimen  was,  therefore,  prepared  by  a vacuum  distillation 

technique  to  remove  both  the  heavier  constituents,  which  tended  to  deactivate 

the  source,  and  the  lighter  constituents,  which  were  considered  to  be  more 

sensitive  to  sample  handling  and  storage  temperatures  and,  therefore,  less 

reliable  for  "fingerprinting"  purposes.  Recent  changes  in  the 

design  of  our  field  ionization  sources  have  enabled  us  to  activate 
o 7 

them  at  1000  C.  The  resulting  pyrolytic  carbon  whiskers 

are  relatively  inert  with  respect  to  the  chemical  action  of  oil  constituents 

and  have  made  It  possible  to  obtain  molecular  ion  profiles  of  crude  gnd 

refined  oils  without  preparation  by  vacuum  distillation.  Figures  9,  10 

o 

and  11  show  examples  of  spectra  obtained  with  a 60  sector  magnet  mass 
analyzer  equipped  with  a pyrolytic  whisker  field  ionization  source.  This 
system  was  used  for  the  analyses  of  28  oil  specimens  supplied  to  us  by  the 
U.  S.  Coast  Guard. 


I 


16 


FIGURE  9 60  SECTOR  MAOTET  SPECTROGRAM  OF  IRANIAN  CRUDE  OIL 


FIGURE  10  60  SECTOR  MAGNET  SPECTRUM  OF  COAL  PRODUCTS 


Review  of  Data  Analysis  Techniques 


Our  first  attempt  to  define  criteria  for  determining  whether  or  not 
two  spectra  represent  the  same  oil  Involved  the  use  of  parametric  multi- 
variate statistics.  The  existence  of  a "true"  or  population  mean  spectrum 
was  assumed  to  exist  for  every  oil.  Variations  of  individual  spectra  of 

the  same  oil  about  thetiue  spectrum  were  assumed  to  be  random  and  independent. 

th 

Independence  means  that  variations  in  the  sizes  of  peaks  at  the  j and 
th 

k mass  numbers  are  uncorrelated  for  j ^ k.  Each  spectrum  was  normalized 

so  that  the  sum  of  the  squares  of  the  peak  areas  was  equal  to  unity. 

The  normalized  peak  areas  at  each  mass  number  were  assumed  to  be  normally 

distributed  about  their  respective  population  mean  values.  When  these 

8 

assumptions  are  valid,  it  can  be  shown  that 


[i  ■ -=1  ■ • 


th 

where  x is  the  normalized  area  of  the  peak  at  the  j mass  number  whose 

2 

population  mean  value  is  p,  , q is  the  population  variance  for  peak  areas 

*3 

th  2 

at  the  J mass  number,  J is  the  number  of  peaks  in  the  spectrum,  and  y 

a 

is  the  Chi-squared  statistic  with  J - 1 degrees  of  freedom  evaluated  at 


the  100%  significance  level.  The  number  of  degrees  of  freedom  is  J - 1,  as  the 
normalization  constraint  removes  one  degree  of  freedom.  According  to  the  above 
equation,  a is  the  probability  fhat  the  sum  of  the  squares  of  all  the  spectral 
differences  (normalized  to  unit  variance)  will  exceed  the  threshold  x given 

that  the  x 's  and  's  represent  the  same  oil.  In  other  words,  x ^ defines 

J J O' 
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100(1  - Q-)%  confidence  limits  for  the  measurement  of  molecular  ion  profiles 
of  the  oil  whose  true  spectrum  is  defined  by  the  |j,.,  j = 1,  2,  . . . , J. 

This  approached  failed  because  it  was  impractical  to  obtain  enough 
spectra  for  reasonable  estimates  of  the  population  means  and  variances 
and  because  it  was  discovered  that  the  data  would  not  support  the  assumption 
of  independence. 

Criteria  for  the  Identification  of  oils  by  mass  spectrometric  finger- 
printing can  be  defined  in  terms  of  sample  means  and  sample  variances  of 
spectra  whose  peak  area  distributions  are  correlated,  by  the  use  of  a 

9 

multivariate  generalization  of  the  Student-t  distribution.  Unfortunately, 
the  number  of  spectra  required  for  this  analysis  is  greater  than  the  number 
of  peaks  in  the  spectrum,  A test  for  the  identity  of  the  spectrum  of  an 
unknown  specimen  with  a reference  spectrum  consisting  of  peaks  at  30  different 
mass  numbers  would  require  at  least  31  spectra  of  the  unknown  and  the 
reference  each.  A technique  in  which  this  analysis  is  limited  to  peaks 
at  mass  numbers  where  spectral  differences  appear  to  be  significant  is 
not  valid,  as  the  generalized  Student-t  model  cannot  be  applied  to  peaks  at 
selected  mass  numbers.  Therefore,  the  generalized  Student-t  model 
was  abandoned. 

At  this  point,  we  had  to  reassess  the  problem  of  spectral  data  analysis. 
If  one  regards  the  task  of  identifying  oils  as  a forensic  problem,  the 
valid  use  of  statistical  models  is  attractive  because  it  provides  answers 
to  the  questions:  "what  are  the  chances  for  not  identifying  the  culprit?" 
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and,  presumably  of  equal  or  greater  importance,  "what  are  the  chances  for 
making  a false  identification?"  The  alternative  is  an  empirical  approach: 
define  an  identification  criterion  and  test  it  on  a "training  set"  of  spectra 
of  known  identities.  The  following  sections  consist  of  a general  overview 
of  some  of  the  statistical  aspects  of  using  spectral  information  for 
identification  purposes  and  a description  of  the  empirical  technique  that 
was  finally  used  for  classifying  spectra  and  for  measuring  the  similarity 


between  two  spectra. 


Ill  THE  IDENTIFICATION  OF  OILS  BY 

FIELD  IONIZATION  MASS  SPECTKOMETRY 


Defining  an  Identification  Criterion 


Spectrograms  can  be  ranked  in  order  of  their  similarity  to  a 

reference  spectrogram  by  counting  the  number  of  positions  in  the  spectral 

range  at  which  the  relative  peak  heights  or  the  relative  "highs"  and  "lows" 

in  the  spectral  envelope  are  approximately  equal.  If  Z^  is  a measure 

th 

of  the  difference  between  two  spectra  at  the  J mass  number  and 
t is  a fixed  threshold  value,  the  number  of  values  of  j for  which 
Z^  St  can  be  used  as  a measure  of  similarity.  This  is  a binary 
decision  technique;  differences  either  exceed  threshold  or  do  not 
exceed  threshold.  For  identifying  complex  mixtures  by  the  use  of  field 
ionization  mass  spectrometry,  the  threshold  value  t should  be  chosen 
so  that  comparisons  between  spectra  of  the  same  specimen  rank  high  on 
the  similarity  scale,  while  comparisons  between  spectra  of  different 
mixtures  result  in  low  similarity  scores. 


A mass  spectrum  can  be  represented  by  a vector  X with  components 

1 

X , X , . . . , X , where  x is  the  height  or  area  of  the  peak  at  the  j 
12  J j 


th 


mass  number.  The  vector  representation  of  a spectrum  can  be  normalized 

so  that  the  sum  of  all  the  components  or  all  of  the  squared  components 

I'lHiiils  III!  I I V . rile  <1  I I !<•  I'flii'i-  ln'lwiM'ii  I w<*  I I’ll  X iilul  Y fiill  Im-  roitipiil  ciI 

US  the  difference  between  their  normalized  J-dlmensional  vector  representations. 

The  component,  x - y , of  this  vector  difference  is  one  way  to  define  Z . 

J J J 
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For  the  purpose  of  deciding  whether  or  not  two  spectra  represent  the 


same  oil,  an  identification  criterion  must  be  chosen.  The  latter  may  be 

defined  as  an  arbitrary  lower  limit  for  the  degree  of  similarity.  To  be 

concise,  H is  the  null  hypothesis;  spectra  X and  Y represent  the  same  oil. 
o 

Reject  H when  Z > t for  at  least  R values  of  j,  j = 1,  2,  . . .,  J. 
o j 

The  degree  of  similarity  is  given  by  J - R,  where  R is  called  the  "Hamming 

distance."  Clearly,  if  a small  value  for  t is  chosen,  H will  be  rejected 

o 

in  a large  number  of  cases  where  X and  Y represent  the  same  oil.  Conversely, 

if  a large  value  of  t is  chosen,  H will  be  accepted  in  a large  number  of 

o 

cases  where  X and  Y represent  different  oils.  A prudent  decision  on  the 
value  of  t can  be  made  on  the  basis  of  a statistical  model  or  empirically, 
as  a result  of  the  analysis  of  a set  of  spectra  of  known  identity. 


Statistical  Model 

Assume  H is  true  and  suppose  that  for  a prescribed  probability  cy 
o 

a threshold  t can  be  determined  such  that 


J^jI 


= at  j = 1 or  2 or  . 


or  J 


(1) 


Assume  that  the  spectral  differences  Z and  Z are  stochastically  independent 

J K 

for  j ^ k.  The  requirement  Z St  for  all  J values  of  J as  the  idehtif icatlon 

%) 


criterion  implies 


1 

*1  J 

accept  1 

is  triiej  = (1  - cy) 

(2) 

reject  H I 

H is  truel  = 1 - (1  - ry)"^ 

(3) 

o 1 

° J 

Equation  (3)  describes  the  probability  for  incorrectly  rejecting  H 

o 
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r 


Suppose  wc  are  willing  to  tolerate  this  error  in  100  A%  of  the  spectrum 
pairs  analyzed.  Then 


1/J 


1 - (1  - O’)  = A 

Solving  for  a obtains 

cy  = 1 - (1  - A)' 

Combining  equations  (1)  and  (5)  yields 
> tl  = 1 - (1  - A) 


(4) 


(5) 


1/J 


(6) 


Equation  (6)  enables  us  to  choose  a threshold  t so  that  the  risk  for  error 

is  equal  to  A when  the  number  of  possible  spectral  differences  is  J and  when 

the  identification  criterion  requires  zero  spectral  differences  greater  than  t. 

As  an  example,  consider  the  case  Z = (x  - p,  )/(j  , where  x is 

J J J J J 

2 

distributed  normally  with  mean  u,  and  variance  a . In  other  words,  we 

J 

are  comparing  the  vector  representation  of  a single  spectrum  of  a particular 
oil  to  the  vector  representation  of  the  "true"  spectrum  of  the  same  oil. 

For  spectra  consisting  of,  say,  J = 20  peaks  and  an  acceptable  risk  of 
A = 0.05  the  value  of  t for  which 


p 

m 

> t 

,0.05 

= 1 - (0.95)  =-  0.0025 


(7) 


is  3.01,  as  found  in  statistics  tables. 

The  identification  criterion:  |Z. 
of  j,  implies 


£ t for  at  least  J - 1 values 
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P r accept  H 1 
1 

H true 
0 

= (1  - 

a)  + 

J - 1 

~ O'' 

(8) 

In  general,  the  identification  criterion 

S t for  at  least  J 

- R 

values  of  j,  implies 

r 1 

1 r 

r J-r 

. ® (1-0) 

(9) 

P 1 accept  H 

L “ 1 

H true 

0 

1 '4 

(J-r).'r.' 

A = 1 


.7  _Ji a 

/ . (J-r).'r.' 


r=P 
(1-a) 


(10) 


Note  that  1 - A is  the  confidence  level  of  this  test  for  the  null  hypothesis: 

X and  Y represent  the  same  oil. 

The  statistical  model  elucidates  two  characteristic  properties  of 
spectral  fingerprinting.  First,  the  significance  of  a spectral  difference 
decreases  as  the  number  of  possible  differences  increases.  In  comparisons 
of  spectra  of  the  same  oil,  the  chances  for  a large  difference  to  occur  at 
random  are  greater  when  the  mass  range  spans  400  amu  than  where  there  are 
only  20  possible  places  for  differences  to  occur.  Second,  the  spectral 
difference  threshold,  required  for  a given  confidence  limit,  decreases  as 
the  number  of  allowable  differences  in  the  identification  criterion  increases. 
These  two  characteristics  are  demonstrated  in  Table  3. 

The  threshold  values  in  Table  3 were  obtained  in  the  following  way: 

(a)  The  confidence  level  1 - A = 0.98  was  chosen;  (b)  Equation  (10)  was 
solved  for  a by  successive  approximations  for  the  cases  R = O,  1,  2 and  J = 

10,  20,  40;  (c)  A nomal  distribution  with  unit  variance  was  assumed  for 
; (d)  The  threshold  values  for  t were  found  by  using 
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1 


TABLE  3:  Threshold  values  t as  functions  of  J,  the  maximum  number  of 
spectral  differences  possible,  and  R the  maximum  number  of  differences 
acceptable  for  the  identification  test  at  the  98%  confidence  level.  Errors 
are  assumed  to  be  normally  distributed  and  independent. 


R 

J = 10 

J = 20 

J = 40 

1 

3.09 

3.29 

3.48 

2 

2.  28 

2.54 

2.78 

3 

1.87 

2.18 

2.45 

I 

j 

I 

i 

! 

i 

i 

f 


1 

t 

I 

t 

I 


I 


I 
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1 


1 


1 

/2tt 


t 


J 


exp  {-Z  /2)  dZ 


-t 


or 

a = 2 (1  - F(t)). 

where  F(t)  is  the  cumulative  normal  distribution,  found  in  statistics  tables. 


The  true  spectrum  of  an  oil  Is  usually  not  known  and  the  cost  of 
obtaining  a good  estimate  of  it  by  analyzing  a large  data  sample  is 
prohibitive  for  routine  analyses.  The  Student-t  statistic  can  bo  used 
to  test  the  identity  hypothesis  in  terms  of  means  and  variances 
computed  from  small  samples. 

The  sample  mean  vector  representation  of  a set  of  N spectra  is 


where  x 


th 

= 1/N  / X , j = 1,  2 J and  X is  the  J component 

J rri 


th 


of  the  i spectrum. 


th 


The  sample  variance  for  the  j component  is 
N 


2 \ - 2 

S,  = 1/N  Z 

J i=l  Ji  J 
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The  statistic 


Z = 

j 


“ yi^  ~ 


S (l/N  + 1/N  ) 

P X y 


(14) 


where  x is  a sample  mean  of  size  N drawn  from  a normal  population  with 

X 

mean  n,  . , "y  is  a sample  mean  of  size  N drawn  from  a normal  population  with 
J J y 

2 

mean  \ , and  S is  the  pooled  sample  variance,  is  distributed  as  the  Student-t 
j P 

statistic  with  N + N - 2 degrees  of  freedom.  This  statistic  can  be  used 
X y 

to  test  the  hypothesis:  = \ , , j = 1 , 2 . , . . J (i.e.  , the  identity 

hypothesis)  or  to  test  the  hypothesis:  u . / X > J = 1 or  2 or  , . . J. 


In  both  cases,  it  is  assumed  that  the  x and  y have  common  variances, 

j j 

8 

although  the  case  of  unequal  variances  known  as  the  Behrens-Fisher  problem 
can  also  be  analyzed.  The  Student-t  statistic  is  used  for  finding  the 
threshold  value  t as  a function  of  the  sample  sizes  N and  N and  as  a function 


X y 

of  the  value  of  q-  that  satisfies  equation  (10).  In  principle,  this  statistic 

can  be  used  for  estimating  the  risk  of  rejecting  H when  H is  true  (error 

o o 

of  type  1)  as  well  as  the  risk  of  accepting  H when  H is  false  (error  of 


o o 

type  2) . By  estimating  the  latter  error,  it  is  possible  to  construct  a 
"power  function"  which  can  be  used  for  comparing  tests  of  identity. 

For  example,  threshold  values  can  be  found  for  identity  tests  that 
involve  different  degrees  of  similarity  but  have  the  same  confidence 

level.  A test  that  correctly  accepts  H when  all  J spectral  differences 

o 

are  less  than  to  "lay  have  the  same  confidence  level  as  a test  that 
correctly  accepts  when  at  least  J-1  spectral  differences  are  less 
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than  • However,  the  two  tests  will,  in  general,  differ  in  their 

ability  to  reject  H when  H is  false. 

o o 

The  statistics  model  for  choosing  threshold  values  relies  on  the 
assumption  that  spectral  differences  are  stochastically  independent. 

9 

A multivariate  test  for  this  assumption  can  be  found  in  the  literature. 

The  independence  hypothesis  was  tested  on  samples  of  field  ionization 
mass  spectra  ranging  in  size  from  9 to  17.  In  general,  spectral 
differences  at  different  mass  numbers  were  not  Independent.  In  retrospect 
this  .'psult  could  have  been  anticipated.  Oils  are  introduced  into  the 
spectrometer's  ionizer  by  means  of  a temperature  controlled  solid 
insertion  probe.  The  more  volatile  constituents  of  these  complex 

mixtures  are  mass  analyzed  at  the  beginning  of  the  half  hour  data  collection 
period,  during  which  the  entire  mass  range  is  scanned  approximately 

1000  times  and  integrated  into  a single  spectrum.  Since  volatility  decreases 
approximately  with  molecular  weight,  variations  in  operating  parameters  with 
time  produce  errors  that  are  correlated  with  molecular  weight.  It  should  be 
emphasized  that  the  correlations  of  these  errors  are  crucial  to  the  Independence 
hypothesis  but  not  necessarily  to  the  total  error  magnitudes.  The  efficiency 
of  the  ionization  source  is  known  to  change  during  a single  mass  analysts  and 
may  be  responsible  for  some  of  the  observed  systematic  errors.  Similar 
problems  can  be  anticipated  in  the  analysis  of  spectral  data  obtained  by 


30 


other  analytical  techniques.  For  example,  the  responsl vities  of  gas  and 

liquid  chromatography  detectors  are  generally  functions  of  temperature 

and  the  flow  rate  of  the  carrier  medium.  The  acquisition  of  a chromatogram 

consisting  of  20  or  more  peaks  will  usually  require  enough  time  for  the 

% 

drifts  in  temperature  and  flow  rate  to  be  significant  (i.e. , to  produce 
correlation  coefficients  that  are  significantly  different  from  zero).  Thus 
the  use  of  statistics  models,  whose  validities  rely  on  the  requirement  of 
independent  errors,  does  not  appear  to  be  a practical  solution  to  the 
problem  of  analyzing  real  spectral  data.  Notwithstanding  this  conclusion, 
the  model  described  above  elucidates  the  relation  between  the  significance 
of  large  spectral  differences  at  a few  mass  numbers  and  small  differences 
at  many  mass  numbers  and  forms  a basis  for  the  empirical  approach 
described  in  the  next  section. 
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Empirical  Model 


The  empirical  approach  differs  from  the  statistical  model  in  the 

way  in  which  threshold  values  are  chosen.  The  definition  of  an 

identification  criterion  or  "discriminant  function"  is  essentially  the 

same  in  both  models  with  the  exception  that  the  empirical  case  threshold 

values  are  usually  not  restricted  to  a single  value  (i.e.,  the  usual 

case  is  t = t(j)).  Empirical  techniques  are  described  in  the  literature 

under  the  headings  "Pattern  Recognition"  and  "Learning  Machine  Techniques 

13,14 

As  this  literature  has  become  extensive  in  recent  years,  we  will  limit 

the  following  discussion  to  those  techniques  used  in  this  project. 

To  define  a discriminant  function  for  determining  whether  or  not 
X and  Y represent  spectra  of  the  same  oil,  we  need  to  define  threshold 
values  t.  such  that  Z ^ t.,  for  at  least  J -R  values  of  J,  implies 

J I j I J 

identity  and  jzj  j > t^ , for  at  least  R + 1 values  of  j,  implies  that 
X and  Y represent  different  oils.  This  discriminant  function  differs 
from  the  identification  criterion  used  in  the  statistical  model  only  in 
the  use  of  J threshold  values  t j . Note,  however,  that  the  use  of  a 
single  threshold  value  in  the  statistical  model  was  merely  a mathematical 
convenience  and  not  a logical  necessity.  In  the  ideal  empirical  case 
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the  t are  simply  chosen  so  that  Z ^ ^ ^ that  represent 

j J J 


spectra  of  the  same  oil.  As  additional  spectra  are  acquired,  the  t^ 

are  adjusted  so  that  the  condition  Z s t is  maintained  for  all  vector 

«J  J 

pairs  in  the  set.  If,  by  applying  this  empirical  method  to  the  sets  of 

spectra  representing  many  oils,  every  spectrum  can  be  unambiguously 

identified  with  its  respective  class  (type  of  oil) , we  say  that  the  data 

are  "separable."  The  assumption  is  made  that  if  sufficiently  large  sets 

of  data  are  used  - they  are  referred  to  as  "training  sets"  - the  values 

of  t will  converge  to  their  respective  limits.  At  this  point  the  vector 
j 

representation  of  an  unknown  spectrum  can  be  classified  correctly  by 
applying  the  discriminate  functions  iteratively  to  differences  between 
the  unknown  spectrum  and  spectra  from  each  of  the  training  sets. 

Two  problems  that  can  arise  are:  (1)  The  unknown  spectrum  may  not 
be  a member  of  any  of  the  training  classes,  in  which  case  correct 
identification  is  impossible.  (2)  The  training  sets  may  not  be 
completely  separable,  in  which  case  a spectrum  will  sometimes  be 
identified  with  more  than  one  oil.  Making  the  wrong  choice  in  this 
case  is  equivalent  to  an  error  of  type  2 in  the  statistical  model. 

In  addition  to  the  above  problems,  the  identification  of  oils  for 
forensic  purposes  often  Involves  comparisons  between  weathered  and 
unweathered  specimens  of  the  same  oil.  For  an  oil  whose  spectrum  falls 
outside  the  set  of  training  classes,  such  as  a weathered  oil,  the  degree 
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1)1  H I in  i 1 .1  )•  i t V i.-iiahl  cK  us  L<)  spccily  the  class  that  most  l■eselnl)lcs  the 
unknown.  Furthermore,  in  cases  of  ambiguous  identity,  it  facilitates  a 
choice  between  candidate  classes.  Therefore,  the  use  of  the  degree  of 
similarity,  defined  in  the  statistics  model,  is' a desirable  adjunct  to  the 
empirical  classification  technique. 

Reduced  Dimen s.ionality 

Well  resolved  spectra  contain  copious  quantities  of  information. 

For  example,  the  number  of  distinctly  different  mass  spectra  that  are 
possible  with  unit  mass  resolution  over  a range  of  100  amu  and  with 
peaks  that  can  be  resolved  into  10  significant  magnitudes  is  10^^^. 

State  of  the  art  field  ionization  mass  spectrometry  is  capable  of 
producing  molecular  ion  profiles  of  complex  mixtures,  such  as  crude 
and  refined  oils,  that  span  more  than  400  amu.  Examples  of  wide  mass 

range  molecular  ion  spectra  are  shown  in  Figures  9,  10,  and  11. 

The  computer  analysis  of  400-dimensional  vectors  is  expensive  and 
unnecessary.  The  envelope  of  a spectrum  can  be  digitized  by  using  a 
technique  that  imitates  analog  to  digital  conversion.  The  envelope  of 
a 400-peak  spectrum  can  thus  be  transformed  into  a histogram  whose 
spectral  range  (abcissa)  is  partitioned  into,  say,  10  intervals.  In 
other  words,  a 400-component  spectral  representation  can  be  transformed 
into  an  arbitrarily  smaller  dimensionality.  Moreover,  a unique 
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advantage  of  field  ionization  mass  spectrometry  is  that  the  envelopes  of 
molecular  ion  distributions  of  crude  oils  attain  their  maximum  values  at 
high  mass  numbers,  while  those  of  refined  oils  peak  at  relatively  low  mass 
numbers.  In  fact,  it  is  usually  possible  to  identify  oils  as  crudes  or 
refined  products  by  visual  inspection  of  their  spectrograms,  as  illustrated 
by  Figures  5 and  6.  Therefore,  in  the  case  of  field  ionization  spectra, 
the  cumulative  spectral  distribution  appears  to  be  useful  ps  a means  of 
describing  the  gross  characteristics  of  oil  spectra.  For  example,  the 
fifty  percentile  point  is  equal  to  the  mass  number  at  which  the  spectrum 
can  be  divided  into  two  equal  areas.  The  fifty  percentile  points  for  refined 
oils  will,  in  general,  be  smaller  than  those  of  crude  oils.  Thus  the  use  of 
the  digitized  cumulative  spectral  distribution  provides  a means  of 
categorizing  oil  spectra  as  well  as  reducing  the  dimensionality  of  the 
computer  analysis. 

The  use  of  the  digitized  cumulative  distribution  of  a spectrum  rather 
than  its  constituent  peaks  precludes  many  of  the  problems  associated  with 
spectrometer  resolution  and  the  deconvolution  of  fused  peaks.  To  a 
large  extent,  it  provides  a common  denominator  for  sets  of  spectra 
whose  quality  (resolution  in  particular)  is  variable.  The  obvious 
price  that  is  paid  for  the  advantages  of  this  technique  is  a loss  of 
spectral  information.  Notwithstanding  this  limitation,  we  were  able  to 
classify  154  spectra  into  35  different  categories  with  95%  success. 
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The  use  of  digitized  cumulative  spectral  distributions  was  conceived 


as  a means  of  programming  a computer  to  separate  the  spectra  of  crudes 
from  those  of  refined  products,  thus  avoiding  the  cost  of  unnecessary 
detailed  spectral  comparisons.  A computer  can  be  programmed,  for  example, 
to  avoid  identity  tests  between  spectra  whose  respective  fifty  percentile 
points  are  separated  by  more  than  20  amu.  Furthermore,  the  resolution  of 
our  data,  over  long  periods  of  time,  has  been  inconsistent  due  to  the 
fact  that  considerable  "hardware"  development  has  occurred  during  the  course 
of  this  project,  and  techniques  based  on  the  use  of  cumulative  distributions 
appeared  to  have  the  potentiality  for  enabling  meaningful  comparisons  to 
be  made  between  "good"  and  "bad"  data. 

The  following  method  for  data  analysis  was  devised  as  an  inexpensive 
and  mathematically  simple  approach  to  the  problem  of  identifying  complex 
mixtures  by  the  use  of  cumulative  spectral  distributions. 

Oil  Identification  by  the  Use  of  "No-Resolution  Mass  Spectrometry" 

Consider  a smooth  and  continuous  spectrogram  of  the  form  y = y(m) , 
where  y is  the  height  of  the  chart  recording  and  m is  a continuous  variable 
in  units  of  fractional  amu.  The  percent  cumulative  spectral  distribution 
can  be  defined  as 

mi 

100  f ydm 

F(m^)  = 


(15) 


where  and  are  the  lower  and  upper  limits  respectively  of  the  mass 
range.  Figure  12  shows  the  percent  cumulative  spectral  distributions  of 
34  spectra  comprising  the  representations  of  three  different  oils.  The 
three  distinct  families  of  cqrves  in  Figure  12  are  evidence  of  "clustering" 
or  the  separation  of  data  fnto  classes.  Examples  of  spectra  from  these 
three  classes  are  shown  in  Figures  13,  14  and  IS* 


Define  by  F(Mp)  = p%  (e.g.  , if  mj^  = then  F(mj^) 

The  following  vector  can  be  constructed 


where  x = M = the  Increment  in  mass  range  that  is  traversed 

J J+10  J 

th 

as  the  cumulative  distribution  passes  from  the  J percentile  point 
to  the  (j  +10) th  percentile  point.  The  choice  of  dimensionality  for 
this  vector  was  arbitrary,  the  omission  of  the  components  and  XgQ 
was  not.  The  components  of  this  vector  are  crude  derivatlires  of  the 
form  Lm/  Ly,  For  spectrograms  that  have  no  peaks  at  the  upper  or  lower 
limits  of  the  spectrometer's  sweep  range  a shift  of  the  mass  scale 
relative  to  the  spectrum  has  no  effect  on  the  vector  representation  of 
equation  16  (i.e. , the  mass  scale  can  be  shifted  until  one  of  the  mass 
range  extromites  intercepts  a peak).  In  other  words,  there  is  no  need 
for  assigning  mass  numbers  to  individual  peaks.  The  inclusion  of  Xq 
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MASS  NUMBER,  AMU 


FIGURE  13 

GULF  NO.  6 FUEL  OIL 
CINCINNATI  REFINERY 


m,  amu 


SA-3531-7 


39 


SHELL  No.  6 FUEL  OIL 


40 


TA-340522-302 


and  in  equation  (16)  requires  that  the  data  analysis  span  precisely 

the  same  mass  range  for  each  spectrum.  In  view  of  the  additional  calibration 

computations  required,  the  analysis  is  simplified  by  omitting  x and  x 

0 90 

Note  also  that  since  the  sum  of  the  x.'s  equals  the  mass  range  (a  constant), 
at  least  one  of  them  should  be  eliminated  as  it  is  linearly  dependent  on 
the  rest. 


The  following  procedure  can  be  used  to  define  a discrmlnant  function 
for  identifying  spectra  of  oil  A. 

1.  Obtain  a set  of  spectra  of  oil  A. 

2.  Construct  vectors  of  the  form  of  equation  (16)  for  each  of  the 
spectra . 

th 

3.  The  range  for  the  j vector  component  in  the  training  set  is 


D.  = xmax  - xmin  . and  the  mid-range  is  x = (xmax  + xmin  )/2.  where 
J j J J J j 

th 

xmax.  and  xmin.  are  the  largest  and  the  smallest  j components  of  the 
3 3 

training  set  respectively.  To  test  an  unknown  vector  Y with  components  y 

j 

for  membership  to  class  A,  we  can  use  the  discriminant  function  defined  by 


Zj  «;  tj  for  all  J = 8 values  of  J simultaneously,  (17) 


wh'-  re 


Z = y - X 
J J j 


t = D /2 

j j 


If  jz^l  s tj  for  J - R values  of  J,  we  can  say  that  the  oil  represented 
~ , th 

by  X is  an  (R  + 1)  choice  for  the  identity  of  Y. 


This  discriminant  function  insures  the  correct  classification  of 


all  members  of  the  training  set.  It  does  not  exclude  the  possibility  of 
identifying  a spectrum  from  a training  set  with  more  than  one  oil, 
however.  The  frequency  with  w4iich  training  set  spectra  are  classified 
ambiguously  provides  a measure  for  evaluating  the  merits  of  this 
discriminant  function  before  applying  it  to  spectra  of  unknowns.  The 
identification  criterion,  equation  (17)  , is  equivalent  to  a partitioning 
of  the  vector  space  into  8-dimensional  paralleloplpeds  vlthcenters  at 
Xu  . . . Xjj,  where  N is  the  number  of  classes  in  the  training  set. 
The  success  of  this  technique  relies  on  the  analysis  of  training  sets 
that  are  large  enough  to  provide  good  estimates  of  the  dispersion  of  the 
data,  measured  in  terms  of  the  range,  and  of  the  true  vector  represen- 
tation, measured  in  terms  of  the  mid-ranges  of  each  component.  From 
a practical  point  of  view,  this  technique  appears  at  first  to  offer  no 
advantage  over  statistical  models,  in  which  the  population  means  and 
variances  must  be  estimated.  Both  models  require  large  sample  sizes 
(i.e. , M 2 10).  However,  the  first  objective  of  mass  spectrometric 
fingerprinting  is  to  identify  oils  and  the  empirical  approach  provides  a 
means  to  meet  this  objective  without  the  constraints  of  a specific  model 
for  the  data. 
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I V EXPE  RI ME  NTAL  RE  SULTS 


1 


i 


Quadrupole  Data 

Table  4 summarizes  the  results  of  applying  the  procedure  outlined 

i 

above  to  154  field  ionization  mass  spectra  comprising  35  different  classes 
ranging  in  size  from  2 to  17 . A quadrupole  mass  separator  was  used  for 

I 

analyzing  specimens  prepared  by  vacuum  distillation.  The  class  number,  class 
code  name  and  class  size  are  tabulated  in  the  first  three  columns  respectively. 

The  use  of  Table  4 can  be  described  by  a few  examples:  The  zero  in  the 
"False  id"  column  of  the  first  class  means  that  none  of  the  17  spectra  of 
Shl-6  oil  was  identified  with  a class  other  than  its  own.  The  oil  whose 
molecular  ion  distribution  is  more  similar  to  that  of  Shl-6  than  any  of  the 
other  oils  is  G-6 , Phil.  The  "4th  Choice"  column  of  the  K = 1 row  shows 
that  G-6,  Phil  was  the  fourth  choice  for  the  Identity  of  seven  of  the 
Shl-6  spectra.  In  the  case  of  the  nine  Zel-Cr  spectra  (K  = 2)  the  fourth 

j 

choice  was  Tobl-Cr  for  three  of  them  and  one  class  each  for  three  others.  j 

1 

The  average  dispersion  of  each  class  of  spectral  representations  is  tabulated  j 

i 

in  the  "R%"  column  where  the  entries  are  equal  to  D /x  averaged  over  the 

j J 

eight  values  of  j.  Most  of  the  classes  are  smaller  in  size  than  7,  in 
which  cases  the  estimates  of  range  are  unreliable.  However,  considering 
that  the  8-dimensional  vectors  represent  only  the  gross  characteristics 
of  an  oil  spectrum  it  is  remarkable  that  ambiguities  resulted  in  only  5% 

of  the  spectra  and  that  all  of  these  were  falsely  Identified  with  the  j 

I 

same  oil  (viz.  K = 2) . j 

I 

i 

I 

I 

J 
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K 

Code  Name 

Size 

False  ID 

2nd  Choice 

3rd  Choice 

4th  Choice 

R% 

1 

Shl-6 

17 

0 

0 

0 

7(G-6,Phil 

39.4 

2 

Zel-Ci 

1? 

„ 

ft 

9 

0 

0 

0 

3(Tobl-Cr) 

Sts-Cr 

SBDl-Cr 

Apa-Cr 

32.5 

3 

G-6,Cin 

10 

0 

0 

0 

0 

60.1 

4 

QQ-Cr 

10 

0 

7 (AmoDl-4) 

3(AmoDl-4) 

4(ShoD4-6) 

11.8 

5 

C-6,Phil 

13 

~0 

0 

0 

2(shl-6) 

31.1 

6 

Bol-Cr 

ft 

7 

5(Zel-Cr) 

2(Zel-Cr) 

SBDl-Cr 

0 

3(SBD1-Cr) 

9.4 

7 

G-6,S.Fe 

3 

0 

AmoDl-4 

AmoDl-4 

0 

6.3 

8 

Sts-Cr 

3 

0 

Zel-Cr 

Zel-Cr 

Zel-Cr 

6.8 

9 

Say-6 

3 

0 

0 

2(Zel-Cr) 

Zel-Cr 

3.6 

10 

Apa-Cr 

3 

Zel-Cr 

Zel-Cr 

0 

Zel-Cr 

15.7 

11 

G-6,P.A, 

3 

0 

0 

3(Zel-Cr) 

Bol-Cr 

3.3 

12 

I ra-Cr 

3 

0 

AmoDl-4 

2(AmoDl-4) 

Zel-Cr 

6.5 

13 

Tob-Cr 

3 

0 

3(Zel-Cr) 

Bol-Cr 

SBD4-Cr 

3.3 

14 

TobL-Cr 

4 

0 

2(Zel-Cr) 

Zel-Cr 

Zel-Cr 

8.7 

15 

Has-Cr 

3 

0 

0 

3(Zel-Cr) 

0 

5.2 

16 

Say-2 

3 

0 

0 

0 

AmoDl-4 

16.4 

17 

Say-Lub 

2 

0 

0 

0 

0 

1.4 

18 

SB-Cr 

.. 

3 

Zel-Cr 

2(Zel-Cr) 

Bol-Cr 

2 (Bol-Cr) 

Apa-Cr 

4.3 

19 

SBDl-Cr 

3 

Zel-Cr 

2(Zel-Cr) 

Bol-Cr 

Bol-Cr 

5.3 

20 

SBD4-Cr 

3 

0 

2(Zel-Cr) 

Zel-Cr 

Bol-Cr 

TobL-Cr 

5.9 

21 

Shl7 

3 

0 

0 

ShlTDl 

ShlTDl 

13.0 

22 

Amo-4 

3 

0 

0 

0 

0 

9.5 

Continued  . . . 
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Table  4 (Continued) 


K 

Code  Name 

Size 

1 

False  ID 

2nd  Choice 

3rd  Choice 

4th  Choice 

R% 

23 

Shl7Dl 

It 

1 

0 

Shi  7 

Shi  7 

Shl7D4 

0 

6.4 

24 

Shl7D4 

H 

0 

0 

P 

Shl7Dl 

19.8 

25 

AmoDl-4 

B 

0 

0 

0 

Shl-6 

52.7 

26 

AmoD4-4 

3 

0 

0 

0 

AmoDl-4 

6.9 

27 

Sh33-6 

It 

3 

0 

2(Shl-6) 

Shi -6 

G-6,Phil 

Sh33Dl-6 

9.8 

28 

Sh33Dl-6 

3 

0 

0 

3(Shl-6) 

0 

6.8 

29 

Sh33D4-6 

3 

0 

0 

Shi -6 

2 (Shl-6) 

4.1 

30 

Sho-6 

3 

0 

0 

0 

0 

4.5 

31 

ShoDl-6 

3 

0 

3(Zel-Cr) 

Apa-Cr 

Apa-Cr 

8.4 

32 

ShoD4-6 

3 

0 

0 

ExoD4-2 

AmoDl-4 

14.3 

33 

Exo-2 

3 

0 

3(AmoDl-4) 

AmoDl-4 

0 

4.9 

34 

ExoDl-2 

3 

0 

2(AmoDl-4) 

AmoDl-4 

0 

6.2 

35 

ExoD4-2 

f 1 

3 

0 

0 

2(ShoD4-6) 

2(AmoDl-4) 

QQ-Cr 

7.8 

K = class  number 

R = average  spectral  range  in  percent 

Oil  codes;  Cr  = crude,  -N  = number  N refined  oil,  D1 , D4  = weathered  for  1 and  4 days 
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In  an  effort  to  devise  an  oil  identification  scheme  that  can  provide 


a measure  of  similarity  without  the  requirement  of  ten  or  more  spectra 

per  training  set  specimen,  the  empirical  definition  for  the  threshold 

values  was  modified.  Consider  a set  of  vectors  of  the  form  described  in 

equation  (16)  , representing  the  cumulative  spectral  distribution  of  a set 

of  oil  specimens.  From  this  set  we  pick  a reference  vector  X • A threshold 

r 

value  for  the  component  of  X can  be  defined  as  t = fx  , 

r rj  rj’ 

j = 1,  2,  . . .,  J,  where  0 < f < 1.  In  this  case  we  will  identify  a vector 

X.  with  the  reference  oil  whose  representation  is  X whenlx.  -x  I ^ fx 
1 r I ij  rj!  rj 

for  at  least  J - R values  of  j . This  discriminant  function  provides  two 
degrees  of  freedom,  the  degree  of  similarity,  J - R,  and  the  threshold 
factor,  f.  In  this  case,  where  we  are  trying  to  avoid  the 
large  numbers  of  spectra  required  in  both  the  statistics  and  the  learning 
machine  models,  we  expect  to  determine  both  degrees  of  freedom  empirically 
after  analyzing  sufficiently  large  sets  of  data  in  which  the  number  of 
spectra  per  specimen  is  generally  less  than  three  or  four.  The  requirement 
for  complete  similarity  (i.e.,  the  requirement  R = 0)  has  been  relaxed 
to  avoid  the  large  sample  sizes  needed  for  estimating  the  ranges  of  data. 
Since  there  is  no  way  to  estimate  appropriate  threshold  values  a priori 
because  the  data  are  not  independent  in  the  statistical  sense,  we  will 
determine  these  values  empirically  in  terms  of  how  well  they  separate 
"known  similars"  from  "known  disslmilars ." 
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To  apply  this  analysis  all  the  possible  pairs  in  a set  of  vectors 


I 


I 


I 

] 


L 


are  analyzed  with  one  member  of  the  pair  acting  as  reference.  The  analysis 
is  performed  Iteratively  as  the  parameter  f is  varied  systematically. 

The  results  are  examined  for  the  best  combinations  of  J - R and  f. 

The  individual  quadrupole  spectra  comparising  the  35  classes  listed 
in  table  4 are  identified  with  their  respective  class  numbers  and  oil  code 
names  in  table  5.  Table  6 shows  the  results  of  using  the  set  of  154 
quadrupole  spectra  in  a similarity  measurement  test.  One  vector  from 
each  of  the  six  largest  classes  was  used  as  a reference  and  compared  to 

the  other  153  8-dimensional  representations.  For  each  threshold  factor  | 

I 

f the  numbers  of  correct  and  incorrect  identifications  are  tabulated  as  j 

functions  of  the  degree  of  similarity  J -R.  For  example,  using  spectrum  i 

no.  21  as  reference,  the  remaining  9 members  of  class  3 were  correctly  ' 

identified  while  8%  of  the  143  spectra  from  other  classes  were  incorrectly 


identified  when  a similarity  index  of  J - R = 2 was  used  with  a 
factor  f = 0. 16. 
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'I'AIiIjE  5 Class  Identif  ication  and  Spectrum  Numbers  oJ  ISd 
Quadrupole  Gp<>ctra 


N 

K 

NAME 

N 

K 

NAME 

N 

K 

NAME 

1 

1 

SHL-6 

30 

6 

BOL-CRU 

58 

1 

SHL-6 

2 

1 

If 

31 

6 

II 

59 

1 

•• 

3 

1 

32 

6 

11 

60 

1 

II 

4 

1 

II 

33 

7 

G-6,S  Fe 

61 

12 

IRA-CRU 

5 

1 

If 

34 

7 

II 

62 

12 

" 

6 

1 

If 

35 

7 

II 

63 

12 

•' 

7 

1 

It 

36 

8 

Sts-CRU 

64 

3 

TOB-CRU 

8 

1 

II 

37 

8 

•I 

65 

3 

11 

9 

1 

II 

38 

8 

II 

66 

3 

11 

10 

1 

1 1 

39 

9 

Say-6 

67 

35 

TOBL-CRU 

11 

1 

II 

40 

9 

II 

68 

35 

11 

12 

1 

II 

41 

9 

II 

69 

35 

11 

13 

2 

ZEL-CRU 

42 

10 

Apa-CRU 

70 

35 

" 

14 

2 

II 

43 

10 

II 

71 

14 

HAS-CRU 

15 

2 

II 

44 

10 

II 

72 

14 

11 

16 

2 

II 

45 

11 

G-6, PA 

73 

14 

II 

17 

2 

II 

46 

11 

11 

76 

4 

QQ-CRU 

18 

2 

1 1 

47 

11 

II 

77 

4 

II 

20 

1 

SHL-6 

48 

3 

G-6,CIN 

78 

4 

II 

21 

3 

G-6,CIN 

49 

3 

It 

84 

3 

G-6,CIN 

22 

3 

II 

50 

3 

11 

85 

5 

G-6 , PHIL 

23 

3 

♦ 1 

51 

5 

G-6  PHIL 

86 

3 

G-6,CIN 

24 

4 

QQ-CRU 

52 

5 

11 

87 

4 

QQ-CRU 

25 

4 

II 

53 

5 

1! 

88 

4 

If 

26 

4 

II 

54 

2 

ZEL-CRU 

89 

4 

11 

27 

5 

G-6  PHIL 

55 

2 

11 

90 

4 

II 

28 

5 

II 

56 

2 

ft 

91 

3 

G-6,CIN 

29 

5 

II 

57 

1 

SHL-6 

92 

3 

II 

N = spectrum  number 
K - class  ri.iinber 


49 


TABLE  5 (Continued) 


N 

K 

NAME 

N 

K 

NAME 

N 

K 

NAME 

93 

6 

BOL-CRU  1 

132 

22 

SH17D1  1 

170 

31 

SHOD4-6 

94 

6 

It 

133 

22 

1 

11  « 

i 

171 

32 

EXO-2 

95 

6 

(f 

134 

22 

11 

173 

32 

tl 

96 

6 

II 

135 

23 

SH17D4 

174 

32 

II 

98 

15 

Say-2 

137 

23 

tf 

175 

33 

EXODl-2 

99 

15 

II 

138 

23 

It 

176 

33 

" 

100 

15 

II 

139 

24 

AMODl-4 

178 

33 

II 

101 

16 

Say-LUB 

140 

24 

11 

\ 

179 

34 

EXOD4-2 

102 

16 

" 

141 

24 

1 

II 

) 

180 

34 

II 

105 

5 

G-6,Phil 

142 

24 

" I 

i 

181 

34 

II 

106 

5 

II 

144 

25 

AMOD4-4  { 

107 

5 

II 

145 

25 

„ ' 
1 

108 

5 

II 

146 

25 

" ! 
( 

1 

109 

5 

II 

147 

26 

1 

SH33-6 

110 

5 

II 

149 

26 

" 

113 

17 

SB-CRU 

150 

26 

It 

1 

1 

114 

17 

II 

151 

27 

SH33D1-6 

115 

17 

” 

152 

27 

" 

116 

18 

SBDl-CRU 

154 

27 

It 

118 

18 

II 

156 

28 

SH33D4-6 

119 

18 

II 

157 

28 

It 

120 

19 

SBD4-CRU 

158 

28 

It 

121 

19 

II 

160 

29 

SHO-6 

123 

19 

II 

161 

29 

ft 

124 

20 

SK17 

162 

29 

tt 

126 

20 

II 

164 

30 

SHOD 1-6 

127 

20 

II 

165 

30 

It 

128 

21 

AMO-4 

166 

30 

tl 

129 

21 

II 

168 

31 

SHOD4-6 

131 

21 

It 

169 

31 

tt 

N = spectrum  number  50 

K = class  number 


TABLE  6 Percentages  of  Correct  (Left  Entry)  and  Incorrect  (Right  Entry) 
Classifications  of  Quadrupole  Spectra  of  6 Different  Oils, 


REF 

8 

Total  Sample 

7 

Size  = 

6 

154. 

5 

4 

3 

2 

1 

0.10 

8, 

0 

12, 

0 

25, 

0 

25,  2 

25,  5 

44,  6 

69,  8 

0.12 

12, 

0 

19, 

0 

25, 

1 

25,  5 

25,  5 

69,  6 

69,  9 

0.14 

12. 

0 

25, 

0 

25, 

4 

25,  5 

44,  6 

81,  6 

100,  10 

0.16 

19, 

0 

25, 

1 

25, 

4 

31,  6 

56,  7 

87,  7 

100,  13 

13 

0.10 

12, 

1 

25, 

2 

37, 

8 

37,  14 

37,  24 

37,  33 

50,  5J 

0.12 

12. 

1 

25, 

3 

37, 

10 

37,  18 

37,  30 

50,  40 

75,  57 

0.14 

12, 

3 

25, 

6 

37, 

12 

37,  23 

37,  36 

62,  49 

87,  61 

0.16 

25, 

7 

37, 

13 

37, 

20 

37,  29 

50,  39 

62,  55 

100,  66 

21 

0.10 

0, 

0 

11. 

0 

11, 

0 

11,  0 

22,  0 

56,  0 

78,  2 

0.12 

0. 

0 

11, 

0 

11, 

0 

11,  0 

22,  0 

67,  0 

78,  3 

0.14 

0, 

0 

11, 

0 

11, 

0 

22,  0 

33,  0 

78,  1 

78,  6 

0.16 

0. 

0 

11, 

0 

11. 

0 

22,  0 

44,  0 

89,  1 

100,  8 

24 

0.10 

22, 

0 

78, 

0 

100, 

1 

100,  3 

100,  17 

100,  37 

100,  55 

0.12 

22, 

0 

100, 

1 

100, 

4 

100,  8 

100,  26 

100,  49 

100,  62 

0.14 

89, 

0 

100, 

1 

100, 

6 

100,  15 

100,  38 

100,  52 

100,  66 

0.16 

89, 

0 

100, 

2 

100, 

8 

100,  19 

100,  43 

100,  53 

100,  69 

27 

0.10 

0, 

0 

0, 

0 

33, 

0 

50,  0 

75,  1 

100,  4 

100,  11 

0.12 

0, 

0 

17, 

0 

33, 

0 

58,  0 

83,  4 

100,  8 

100,  20 

0.14 

17, 

0 

25, 

0 

50, 

1 

75,  4 

83,  7 

100,  11 

100,  23 

0,16 

17, 

0 

33 

0 

58, 

1 

83,  6 

92,  7 

100,  15 

100,  25 

33, 

3 

83, 

5 

100, 

10 

100,  17 

100,  28 

100,  43 

100,  54 

33, 

3 

100, 

5 

100, 

14 

100,  22 

100,  35 

100,  48 

100,  59 

33, 

4 

100, 

7 

100, 

18 

100,  27 

100,  36 

100,  50 

100,  59 

50, 

6 

100, 

13 

100, 

23 

100,  29 

100,  39 

100,  54 

100,  64 

I 
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TABLE  6 (Continued) 


. No. 

Class 

Size 

Name 

1 

1 

17 

Shell  No.  6 

13 

2 

9 

Zelten  Crude 

21 

3 

10 

Gulf  No.  6 Cencen 

24 

4 

10 

Quirequire  Crude 

27 

5 

13 

Gulf  No.  6 Phila. 

30 

6 

7 

Bolivar  Crude 

52 


In  general,  table  6 shows  that  for  five  of  the  six  classes  there  are 


pairs  of  parameters  that  enable  all  of  the  members  of  a specified  class 
to  be  identified  correctly  while  only  10%  or  less  for  the  remaining  spectra 
are  identified  incorrectly.  However,  to  avoid  using  large  training  sets 
for  each  class  it  is  necessary  to  demonstrate  the  existence  of  a single 
pair  of  parameters  that  can  be  used  successfully  for  the  identification  of 
most  of  the  classes. 

The  entries  in  table  7 are  the  result  of  averaging  the  respective 
entries  in  table  6 over  the  six  classes.  Only  one  pair  of  parameters, 

J -R  * 2,  f = 0.16,  produces  an  identification  criterion  that  results  in 
all  members  of  the  six  classes  being  classified  correctly.  Unfortunately, 
an  average  of  41%  of  the  remaining  spectra  are  also  identified  with  the 
same  six  classes.  From  a forensic  point  of  view  this  is  equivalent  to 
arresting  all  of  the  culprits  and  41%  of  the  bystanders.  If  an  average 
of  10%  incorrect  identification  is  arbitrarily  taken  as  acceptable, 
table  6 shows  that  approximately  half  of  the  valid  identifications  will 
go  undetected  (viz.  the  entry  at  J -R  = 5,  f = 0.12).  It  appears,  therefore, 
that  on  the  basis  of  the  154  quadrupole  spectra  we  must  conclude  that 
either  large  numbers  of  spectra  per  oil  are  necessary  or  that  it  is 
necessary  to  characterize  spectra  with  more  detail  than  is  possible  with 
an  8-dimensional  vector  representation  of  the  cumulative  spectral  distribution. 
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TABLE  7 Percentages  of  Correct  (Left  Entry)  and 
Incorrect  (Right  Entry)  Classifications 
Averaged  Over  the  6 Classes  of  Table  6 


J-R  8 7 6 5 4 3 2 


f 


0.10 

12, 

1 

35, 

1 

51, 

3 

54, 

6 

60, 

12 

73, 

20 

00 

CO 

30 

0.12 

13, 

1 

41, 

2 

51, 

5 

55, 

9 

61, 

17 

81, 

^5 

02, 

35 

0.14 

27, 

1 

48, 

2 

54, 

7 

60, 

12 

66, 

20 

87, 

28 

94, 

37 

0.16 

33, 

2 

51, 

5 

55, 

9 

62, 

15 

74, 

23 

90, 

31 

100, 

41 

54 


o 

60  Sector  Magnet  Data 

Twenty-nine  oil  specimens,  representing  six  oil  spill  cases,  were 
supplied  to  us  by  the  U.S.  Coast  Guard.  Two  of  these  specimens  were  mass 
analyzed  in  duplicate,  one  was  analyzed  thirteen  times.  The  digitized  data 
for  the  specimen  with  code  name  RDC  17  was  destroyed  by  an  accident  in 
the  laboratory  and  one  of  the  thirteen  replicate  spectra  (tag  ^31)  for 
the  oil  with  code  name  RDC  26  was  inadvertently  omitted  from  some  of  the 
computer  analyses.  Table  8 shows  the  Coast  Guard's  RDC  code  numbers,  the 
tag  numbers  used  by  our  computer  for  spectrum  Identification  and  the  case 
numbers  for  the  42  spectra  comprising  this  data  set. 

The  thirteen  replicate  spectra  are  shown  in  Figure  16.  These 
measurements  were  performed  before  the  completion  of  some  of  the  electronic 
circuitry  for  the  60  degree  sector  magnet  spectrometer  and  with  a prototype 
field  ionization  source  that  enabled  us  to  obtain  spectra  without  the  usual 
specimen  preparation  by  vacuum  distillation.  Due  to  the  developmental  stage 
of  the  spectrometer,  we  obtained  the  following  results:  there  are 
nonlinearities  in  the  mass  scale  and  there  are  systematic  errors  that  can  be 
correlated  with  mass  number  or  with  volatility.  The  latter  problem  is  a 
consequence  of  using  manual  control  on  the  solid  insertion  probe  heater  in 
which  the  oil  was  placed  so  that  the  sample  feed  programming  was  not  constant 
over  the  thirteen  analyses.  The  systematic  error  is  manifested  by  the 
variability  in  the  envelopes  of  the  spectra;  some  are  fairly  flat,  others 
tend  to  peak  at  400  amu.  The  number  of  peaks  detected  by  computer  algorithm 
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TABLE  8 Identification  Code  (ROC)  Numbers  and  Spectrum  Tag  Numbers  for 
42  Sector  Magnet  Spectra  Comprising  28  Oil  Specimens 


CASE  # 

ROC  n 

TAG  # 

43 

1 

34 

2 

15 

3 

39,44 

4 

19 

5 

4,12 

66 

6 

33 

7 

36 

8 

38 

9 

26 

10 

23 

11 

13 

68 

12 

11 

13 

32 

14 

20 

15 

22 

16 

42 

77 

17 

- 

18 

46 

19 

6 

20 

24 

21 

7 

22 

43 

23 

47 

80 

24 

30 

25 

16 

82 

26 

9,14,17,21,25,28,31,35,41,45,48,49,50 

27 

37 

28 

27 

29 

10 
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figure  16 
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varied  from  117  to  172  peaks  per  spectrum  with  an  average  of  145.  The 


I number  of  ions  counted  per  peak  for  the  subset  of  peaks  used  in  this  report 

I was  of  the  order  of  lo"^.  Therefore,  an  average  relative  standard  deviation 

, of  the  order  of  10%  would  be  expected  from  ion  counting  statistics  alone. 

The  actual  standard  deviation,  computed  after  each  spectrum  was  normalized 
to  unit  total  area,  was  16%,  averaged  over  the  38  mass  numbers  selected 
I for  reliability  (i.e. , all  13  spectra  had  detectable  peaks  at  these  mass 

, numbers) . 

t 

I 

[ Spectra  of  the  three  other  oil  specimens  in  case  82  are  shown  in 

1 

i Figure  17.  A comparison  of  Figures  16  and  17  shows  a strong  similarity 

1 

i between  the  spectra  of  oils  with  RDC  numbers  26,  28  and  29,  whereas  the 

j 

I spectrum  of  oil  number  27  is  obviously  different  from  the  others. 

[ 

f Figure  18  shows  cumulative  distributions  for  the  16  spectra  comprising 

i 

i 

I the  field  ionization  mass  spectral  representations  of  the  4 specimens  in 

case  82.  The  similarities  and  the  dissimilarity  evident  in  the  comparison 
of  Figures  16  and  17  are  also  displayed  in  Figure  18.  The  cumulative 
distributions  of  the  spectra  obtained  for  the  five  other  cases  are  shown 
in  Figures  19  through  23.  With  the  exception  of  case  43,  shown  in  Figure  19, 
the  cumulative  distributions  indicate  a separation  of  the  oils  in  each  case 
into  two  or  more  classes. 

While  a cumulative  distribution  contains  all  the  information  Inherent 
in  the  original  spectrum  from  which  it  was  derived,  its  appearance 
emphasizes  the  gross  characteristics  of  the  molecular  ion  mass  distributloa. 
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MASS  NUMBER  — amu 

TA-340522-328 


FIGURE  18 


PERCENT 


r 


180  480 

MASS  NUMBER  — amu 

TA-340522-330 


FIGURE  19  CUMMUIiATIVE  DISTRIBUTIONS  OF 
7 SPECTRA  COMPRISING  5 OIL 
SPECIMENS  COMPRISING  CASE  43 


FIGURE  22  CUMMULATIVE  DISTRIBUTIONS  FOR 

6 OIL  SPECIMENS  COMPRISING  CASE  77 
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MASS  NUMBER  — amu 

TA-340522-327 


FIGURE  23  rUMMULATIVE  DISTRIBUTIONS  FOR 
2 OILS  COMPRISING  CASE  80 
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The  two  distributions  in  Figure  23,  for  example,  are  obviously  different  - 
that  is,  the  separation  between  the  two  curves  is  large  in  comparison  to  the 
range  of  the  replicate  analyses  of  RDC  26  oil  in  Figure  18.  Furthermore, 
the  RDC  25  curve  in  Figure  23  has  its  fifty  percentile  point  at  252  amu 
whereas  number  24  oil  crosses  the  fifty  percentile  point  at  407  amu, 
implying  that  the  former  is  a relatively  "light  weight"  refined  oil 
while  the  latter  is  a crude  or  residual  product. 

An  8-dimensional  vector  representation  was  used  to  characterize  each  of 
the  42  cumulative  distributions.  Similarity  scores  were  computed  for 
comparisons  of  a reference  vector  to  each  of  the  41  remaining  vectors, 
using  threshold  factors  of  2%,  4%,  ...,  207o.  These  computations  were 
performed  for  each  of  the  42  vectors  as  reference.  The  first  of  these  420 
score  tables  is  reproduced  in  Table  9,  where  the  representations  of  41  spectra 
are  compared  to  the  spectrum  with  tag  number  4 (RDC  5) , using  a 2%  threshold 
factor  to  define  the  match  or  mismatch  of  a vector  component  with  the 
corresponding  component  of  spectrum  number  4.  The  first  entry  in  the 
"score  = 0"  column  is  tag  number  6 (RDC  19),  indicating  that  none  of  the  8 
vector  components  representing  the  cumulative  distribution  of  spectrum  number 
6 fell  within  +2%  of  their  corresponding  components  in  the  representation  of 
spectrum  number  4.  Since  RDC  5 and  RDC  19  are  the  code  names  for  oils  from 
different  spill  cases  (see  Table  8),  it  is  reasonable  to  assume  that  these 
are  different  oils  so  that  the  lack  of  similarity  is  not  surprising. 

Reference  to  Table  8 shows  that  RDC  5 was  mass  analyzed  in  duplicate  (viz 
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Table  9 


Similarity  Data  for  43  Sector  Magnet  Oil 
Spectra  Comprising  28  Specimens 


PPF  TAGS 
THMESHOLDs 
SCOPES 


2X 


8 

7 

6 

5 

a 

3 

2 

1 

0 

0 

0 

0 

0 

0 

n 

0 

0 

6 

0 

0 

0 

0 

0 

n 

0 

0 

7 

0 

0 

0 

0 

0 

0 

0 

0 

9 

0 

0 

0 

0 

0 

0 

n 

10 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 1 

0 

0 

0 

0 

0 

0 

0 

12 

0 

0 

0 

0 

0 

0 

0 

0 

13 

0 

0 

0 

0 

0 

0 

0 

0 

la 

0 

0 

0 

0 

0 

0 

15 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

16 

0 

0 

0 

0 

0 

0 

0 

0 

17 

0 

0 

0 

0 

0 

19 

0 

0 

0 

0 

0 

0 

0 

0 

0 

20 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

21 

0 

0 

0 

0 

0 

0 

0 

22 

0 

0 

0 

0 

0 

0 

0 

0 

0 

23 

0 

0 

0 

0 

0 

0 

0 

0 

29 

0 

0 

0 

0 

0 

0 

0 

0 

25 

0 

0 

0 

0 

0 

26 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

27 

0 

0 

0 

0 

0 

0 

0 

0 

28 

0 

0 

0 

0 

0 

0 

0 

0 

30 

0 

0 

0 

0 

0 

n 

0 

31 

0 

0 

0 

0 

0 

0 

0 

32 

0 

0 

0 

0 

0 

0 

0 

0 

33 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

39 

0 

0 

0 

0 

0 

0 

0 

0 

35 

0 

0 

0 

0 

0 

0 

0 

36 

0 

0 

0 

0 

0 

0 

0 

0 

0 

37 

0 

0 

0 

0 

0 

0 

n 

38 

0 

0 

0 

0 

n 

0 

0 

39 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

91 

0 

0 

0 

0 

0 

0 

«2 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

93 

0 

0 

0 

0 

0 

0 

na 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

95 

0 

0 

0 

0 

0 

0 

0 

96 

0 

0 

0 

0 

0 

0 

0 

0 

0 

97 

0 

0 

0 

0 

0 

0 

0 

0 

98 

0 

0 

0 

0 

0 

0 

0 

0 

99 

0 

0 

0 

0 

0 

0 

0 

0 

50 
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tag  numbers  4 and  12),  Table  9 shows  that  tag  numbers  12  got  a similarity 
score  of  1 while  tug  numbers  15  (RDC  2)  and  tag  19  (RUC  4)  got  scores  of 
3 each.  Tag  numbers  4,  12,  15  and  19  all  correspond  to  case  number  43. 

The  cumulative  distributions  shown  in  Figure  19  indicate  that  the  spectra 
in  this  case  were  indistinguishable  within  the  present  constraint  of  a 
16%  standard  deviation. 

The  420  similarity  score  tables  were  reduced  to  "case"  tables.  For 
example.  Table  9 reduces  to  Table  10  where  the  comparisons  to  spectrum 
number  4 are  limited  to  the  other  spectra  in  case  43.  In  general,  the 
similarity  score  for  comparing  spectrum  "x"  to  reference  "Y"  is  different 
from  the  score  obtained  when  spectrum  "y"  is  compared  to  reference  "x" . 

The  two  scores  are  not  independent,  however.  The  analysis  of  12  replicate 
spectra  of  RDC  26  oil  (using  a threshold  factor  of  8%)  resulted  in  12  x 11 
= 132  similarity  scores,  half  of  which  correspond  to  "x  compared  to 
reference  Y"  while  the  other  half  correspond  to  "Y  compared  to  reference 
X".  The  correlation  coefficient  obtained  by  analyzing  the  66  pairs  of 
scores  was  0.88.  It  was  decided,  therefore,  to  use  the  average  of  each  pair. 
For  example,  tag  12  got  a similarity  score  of  1,  when  compared  to  tag  4 as 
reference  and  tag  4 got  a similarity  score  of  3 when  compared  to  tag  12  as 
reference.  Therefore,  a score  of  (3  + l)/2  = 2 was  used  for  the  similarity 
in  a comparison  of  the  spectra  with  tag  numbers  4 and  12,  using  a 2%  thres- 
hold factor. 
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Table  10 


Case  43:  Similarity  Scores  for  6 Oil  Spectra 
Compared  to  Spectrum  Number  4 (RDC  5) 

Entries  = Tag  Nos. 


Table  11  shows  similarity  scores  for  the  7 spectra  in  case  43  using 


threshold  factors  of  87c  and  4%.  The  8%  threshold  data  show  that  when  the 
duplicate  analyses  of  RDC  3 oil  were  compared,  a similarity  score  of  8 
(i.e.,  maximum  similarity)  was  obtained.  The  duplicate  analyses  of  RDC  5 
also  resulted  in  a high  score  of  7 . On  the  other  hand,  when  one  of  these 
spectral  representations,  RDC  5(2),  was  compared  to  the  representation 
of  any  other  oil  in  case  43,  a score  of  8 was  obtained.  When  the  threshold 
factor  was  reduced  to  4%  so  that  the  requirements  for  a spectral  match  were 
more  stringent.  Table  11  shows  that  the  duplicate  spectra  of  RDC  5 were  less 
similar  than  RDC  5(2)  was  to  any  of  the  other  spectra  in  case  43.  On  the 
basis  of  these  results,  we  cannot  distinguish  between  any  of  the  specimens 
in  case  43,  This  result  is  consistent  with  the  undistinguishability  of  the 
cumulative  distributions  shown  in  Figure  19.  The  use  of  other  threshold 
factors  resulted  in  the  same  conclusion. 

In  an  attempt  to  determine  the  significance  of  differences  in  similarity 
scores,  12  replicate  analyses  of  RDC  26  were  analyzed.  Table  12  shows  the 
similarity  scores  for  comparisons  between  twelve  replicates  of  RDC26  and  one 
analysis  each  for  RDC  numbers  27,  28  and  29.  The  X row  shows  the  average  of 
the  scores  resulting  from  comparisons  with  the  replicates  of  RDC26.  The 
entries  under  the  diagonal  line  were  used  merely  to  facilitate  the  computation 
of  the  average  scores,  otherwise  they  are  redundant.  According  to  these 
results,  if  we  assume  that  the  individual  scores  for  comparisons  of  replicate 
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Table  11 


Case  43:  Similarity  Scores  for  Comparisons  between 
7 Spectra  Representing  5 Oil  Specimens 


Threshold  = 8% 


RDC  n 

2 

3(1) 

3(2) 

4 

5(1) 

5(2) 

1 

8,0 

8.0 

7.5 

8.0 

7.0 

8.0 

2 

8.0 

7.0 

7.5 

6.0 

8.0 

3(1) 

8.0 

8.0 

6.5 

8.0 

3(2) 

6.5 

5.5 

8.0 

4 

6.0 

8.0 

5(1) 

7.0 

RDC  # 

2 

Threshold  = 

3(1)  3(2) 

4% 

4 

5(1) 

5(2) 

1 

4.0 

8.0 

5.0 

6.0 

5.0 

7.0 

2 

7,5 

5.0 

6.0 

5.0 

5.0 

3(1) 

7,0 

5.0 

3.0 

8.0 

3(2) 

3.0 

3.0 

7.0 

4 

5(1) 


4.0 


5.0 

4.0 
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Table  12  Case  82;  Similarity  data  for  12  replicate  analyses  o1  R1)C  26  oil  and 

for  single  analyses  of  oils  with  code  numbers  27,  28  and  29. 
Dimens  jo.na  1 ity  = 8,  threshold  = 87( . 


RDC  26;  Grand  average  =6.00  (n 


specti’a  are  distributed  normally  with  mean  = 6.00  and  variance  3.81,  approximately 
70%  of  the  similarity  scores  will  fall  in  the  range  4 £ X £ 8 when  two  spectra 
represent  the  same  oil.  If  we  use  the  average  of  eleven  scores,  90%  of  the 
average  scores  will  fall  in  the  range  4 ^ X s 8.  As  a result  of  these 
estimates,  the  following  conclusions  concerning  case  82  were  drawn: 

(1)  RDC  27  is  different  from  the  other  oils  in  case  82. 

(2)  RDC  28  can  be  identified  with  RDC  26,  as  the  score  X = 6.37, 
computed  from  comparisons  with  all  12  spectra  of  RDC  26,  agrees 
well  with  the  grand  average  score  of  6.00  for  RDC  26  spectra 
compared  to  each  other. 

(3)  RDC  29  does  not  match  well  with  RDC  26.  As  a null  hypothesis, 
assume  that  the  two  specimens  are  from  the  same  source.  Using 
the  student-t  statistic 

t = (X  - X„)  / S ^ (1/n,  - l/n,), 

with  X]^  = 6.00 

nj^  = 66 

X2  = 3.75 

"2  = 

S 2=  3.86 
P 

obtains  t = 3.65  with  76  degrees  of  freedom.  The  probability, 

-4 

P[t  > 3.65],  is  5 X 10  , therefore  the  null  hypothesis  was  rejected. 

By  inference,  this  result  also  implies  that  RDC  29  and  RDC  28  are 
different  oils.  Furthermore,  the  direct  comparison  between  these 
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two  oils  resulted  in  a low  similarity  score  of  2 (see  Table  12), 
which  is  outside  the  range  of  the  66  KDC  26  Intragroup  scores. 

The  results  of  comparing  the  six  specimens  of  case  66  are  shown  in 
Table  13.  The  curves  in  Figure  20  imply  the  existence  of  three  different 
oils  represented  by  the  specimens  with  RDC  numbers  (6,7,8),  (9,10)  and  11. 
The  scores  in  Table  13  reflect  the  apparent  clustering  in  Figure  20  with 
high  scores  for  comparisons  between  the  spectra  of  oils  within  each  cluster 
and  low  scores  for  comparisons  between  oils  in  different  clusters. 
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Table  13  Case  66;  Similarity  Scores  for  Mass  Analyses 
of  6 Oil  Specimens 

Dimensionality  = 8,  threshold  = 8% 


7 

8 

9 

10 

11 

6 

8.0 

7,0 

4.5 

4,5 

2.0 

7 

7.0 

4.0 

4.0 

4.0 

8 

5.0 

5.0 

3.0 

9 

8.0 

4.0 

10 

4.5 

The  scores  for  the  five  oils  in  case  68  are  shown  in  Table  14  where, 
in  spite  of  the  fact  that  the  RDC  12  curve  in  Figure  21  appears  to  be  well- 
isolated  from  the  other  curves,  the  comparison  between  RDC  12  and  RDC  14 
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resulted  in  a score  of  5.  This  result  implies  that  the  IlDC  12  and  RDC  14 


curves  are  parallel  to  within  + over  one-half  of  the  total  mass  range 
(i.e.,  the  8-dimensional  vectors  were  constructed  out  of  eight  of  the  ten 
percentile  intervals  lor  these  curves). 


Table  14  Case  68:  Similarity  Scores  for  Spectra 
Representing  5 Oil  Specimens 
Dimensionality  = 8,  threshold  = 87c 


13 

14 

15 

16 

12 

3.0 

5.0 

3.5 

2.5 

13 

6.0 

8.0 

5.0 

14 

7.0 

5.0 

15 

6.0 

The  scores  for  case  77  are  shown  in  Table  15. 


Table  15 


19 

18  0 

19 

20 


21 


22 


Case  77:  Similarity  Scores  for  Spectra 
Representing  6 Oil  Specimens 
Dimensionality  = 8,  threshold  = 87o 

20  21  22  23 

0 0 0 0 

5.0  0 0 0 

0 0 2.0 
0 0 
7.0 
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The  similarity  score  for  the  two  spectra  in  case  80  was  1,  indicating 


that  in  spite  of  the  difference  between  the  two  curves  in  Figure  23, 
nummulative  distributions  are  parallel  to  within  + 8%  over  one  of  the 
percentile  mass  Intervals.  Judging  from  the  appearance  of  Figure  23, 
interval  is  in  the  center  of  the  mass  range. 


the  se 

ten 

this 
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V CONCLUSIONS 


Identification  criteria  can  be  evaluated  in  terms  of  the  frequency 
with  which  either  of  two  kinds  of  errors  are  made:  (a)  the  common 
identity  of  a pair  of  spectra  is  incorrectly  rejected  and  (b)  the 
common  identity  of  a pair  of  spectra  is  incorrectly  accepted.  The 

significance  of  the  total  difference  between  two  spectra  is  dependent  on 
(a)  the  number  of  possible  differences  (i.e.,  the  number  of  peaks  in  the 
spectra  or  the  dimensionality  of  their  representations),  (b)  the  definition 
of  a spectral  difference  (i.e.,  the  threshold  values),  (c)  the  number  of 
spectral  differences,  (d)  how  well  the  true  spectra  are  known  and  (e)  the 
dispersion  of  the  data  as  measured  by  the  variances  or  the  ranges,  etc. 

The  relationship  between  these  five  parameters  and  the  two  kinds  of  error 
that  determine  the  success  or  failure  of  an  identification  criterion  can 
be  described  precisely  by  assuming  that  errors  in  measurements  are  random 
and  independent  and  applying  a statistical  model.  Notwithstanding  the  fact 
that  these  assumptions  were  invalidated  experimentally,  the  conclusions  of 
the  statistical  model  provided  guidelines  for  constructing  an  empirical 
discriminant  function.  The  application  of  this  discriminant  function  to  a 
set  of  8-dimensional  vectors,  that  described  only  the  gross  characteristics 
of  oil  spectra,  resulted  In  the  classification  of  154  spectra  into  35  classes 
with  an  error  rate  of  5%.  The  success  of  the  reduced  dimensionality  repre- 
sentation is  probably  due  to  the  fact  that  the  envelope  of  a field  ionization 
mass  spectrogram  represents  the  molecular  mass  distribution  in  a complex 
mixture. 
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r' 


Thu  "(luKruu  ol'  .similarity"  is  a prucist;  and  conceptually  simple  way  to 
rank  spectra  according  to  their  similarity  to  a reference  spectrum.  This 
technique  can  be  applied  to  entire  spectra  or  to  their  reduced  dimensionality 
representat ions. 
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VI  APPENDIX 


Description  of  Oil  Identification  Computer  Programs 


Mass  spectra  in  the  form  of  sequences  of  integers 

S = ^y(n):  n = 1,2 4096^  AI 

are  recorded  on  9-track  magnetic  tape  in  the  laboratory.  For  a mass 
r^mge  sweep  period  of  t seconds,  the  nth  integer,  y(n) , is  equal  to  the 
number  of  ions  accumulated  in  a "window"  or  "channel"  of  width  t/4096 
seconds.  The  tape  is  read  by  a computer  (Burroughs  B6700)  and  the 
data  are  stored  on  magnetic  disk  in  a file  called  RAWDATA.  The  9-track 
tape  is  retained  as  an  archive. 


Individual  mass  peaks  usually  span  12  to  28  channels.  Electromagnetic 

interference  from  sources  external  to  the  spectrometer  can  result  in 

extraordinarily  large  numbers  at  random  positions  in  the  spectrum.  This 

noise  appears  as  a set  of  "spikes"  in  a spectrogram  whose  signal  to 

noise  ratio  is  uniform  elsewhere.  The  RAWDATA  file  is  read  by  a program 

called  "SMCXyiT!"  which  removes  Isolatable  noise,  and  smooths  the  data  by 

16 

a least  squares  fit  technique. 


To  detect  and  correct  Isolatable  noise  at  n=n,  the  following 
algorithm  is  used 


IF 

THEN 


y(n^)  > B + T ^y(n^-l)  + y (n^+l)J  /2, 

= j^y(n^-l)  + y (n  + 1)  J /2, 


A2 

A3 
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where  B is  a base,  below  which  no  corrections  are  to  be  made,  and  T is 
a threshold  value.  For  example,  for  T = 3,  y(2000)  is  defined  as  a spike 
if  its  value  is  greater  than  baseline  by  more  than  three  times  the  average 
of  its  immediate  neighbors  y(1999)  and  y(2001).  Correction  consists 
of  replacing  a spike  with  the  average  value  of  its  neighbors,  as 
indicated  in  equation  A3. 

For  data  smoothing,  we  assume  that  the  individual  peaks  can  be 
described  adequately  by  a polymonlal  in  n of  degree  5.  The  value  of  yCn^^) 
in  the  smoothed  spectrum  is  computed  by  applying  a least  squares 

fit  to  data  points  (n,y)  over  the  range  n^^-A  S n ^ n^  + where  1+2^ 

is  the  number  (i.e.,  the  nearest  odd  integer)  of  channels  per  amu. 

The  smoothed  spectrum  is  stored  in  a file  called  SMOOTODATA. 

Spectral  peaks  are  detected,  peak  areas  are  computed  and  mass 
numbers  are  assigned  to  the  spectral  position  variable  n by  the  program 

"pkfind" . 


A mass  peak  is  detected  in  a smoothed  spectrum  at  a n=n^ , according 
to  the  following  criteria: 
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IF 

and 

IF 

and 

IF 

y(n^)  > 2 X noise 

y(n^-l)  s y(n^)  a y(n^+l) 

y(n^-w)  < Ry(n^)  > y(n^+w),  o < R < 1 


A4 

A5 

A6 


Then  n,  is  the  position  of  a peak  maximum.  "Noise"  is  defined  as  either 
the  minimum  of  the  spectrum  or  some  arbitrarily  small  positive  integep 
(threshold  value).  Equation  A4  eliminates  baseline  fluctuations  as 
candidates  for  mass  peaks.  Equation  A5  is  a local  maximum  test  to 
determine  whether  or  not  yCn^^)  is  larger  than  its  two  immediate  neighbors. 
Equation  A6  is  a peak  width  test.  For  example,  if  4w  is  the  number  of 
channels  per  amu,  2w  is  an  approximate  upper  limit  for  the  full  width  at 
half  maximum  of  peaks  in  a well  resolved  spectrum.  Thus,  for  R ^0.5, 
equation  A6  discriminates  against  fused  peaks  and  unsmoothed  "bumps" 

in  the  spectrum.  After  scanning  the  entire  spectrum  for  peaks  and 
storing  the  value  of  n corresponding  to  the  largest  peak,  the  program 
enters  a mass  calibration  routine. 

The  numbers  of  channels  between  the  Jth  peak  and  the  (J-l)th  peak  and 
the  jth  peak  and  the  (j-2)th  peak  are  computed  for  J=3,4,  ...,  NPEAK, 
where  NPEAK  is  the  total  number  of  peaks  detected.  By  visually  inspecting 
a few  spectra  in  the  data  set,  one  can  compute  a range  for  the  number  of 
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channels  between  adjacent  peaks  (e.g.,  4096  channels  spanning  200  amu  = 


20  to  21  channels  per  mass  unit).  The  numbers  of  channels  between  the 
jth  peak  and  its  neighbors  are  compared  to  a range  that  has  been  selected 
according  to  two  criteria. 

1.  The  range  must  be  small  enough  to  discriminate  between  (a)  two 
peaks  that  are  separated  by  approximately  one  mass  unit,  and  (b)  peaks 

that  are  separated  by  two  or  more  mass  units  or  less  than  one  mass  unit 
(i.e.,  "bumps"  that  had  enough  modulation  to  evade  the  screening  of 
equation  A6. 

2.  The  range  must  be  large  enough  to  accomodate  the  variations  in 
the  number  of  channels  per  amu  that  are  due  to  systematic  non-linearities, 
signal  noise  and  the  limit  in  resolution  imposed  by  the  finite  width  of 

a channel . 

When  the  "distance"  between  two  peaks  falls  in  this  range,  the 
reciprocal  distance  (units  = amu  per  channel)  and  the  value  of  n 
corresponding  to  the  Jth  peak  are  stored.  In  other  words,  this  algorithm 
stores  data  that  dlscribes  the  slope  of  a mass  number  versus  channel 
number  calibration  curve,  as  a function  of  channel  number.  These  data 
are  used  to  compute  a regression  curve  for  the  slope  of  the  calibration 
curve  as  a quadratic  function  of  n. 
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At  this  point  in  the  program,  the  computer  displays  the  channel 
number  at  which  the  largest  or  peak  occurred  and  requests  a mass 

number  assignment.  By  inspecting  a spectrogram  of  the  data  being 
analyzed  and  referring  to  spectrometer  calibration  data,  the  user  can 
supply  the  computer  with  the  proper  mass  number.  Next,  the  number  of 
channels  between  the  largest  peak  and  the  (jmax-1)  peak  is  computed  and 
multiplied  by  the  slope  (evaluated  at  the  channel  number  of  the  largest 
peak)  to  determine  the  number  of  mass  units  that  separate  the  two  peaks. 
The  latter  number  is  subtracted  from  the  mass  number  of  the  largest  peak 

th 

to  obtain  a ma.ss  assignment  for  the  (jmax-1)  peak.  The  slope  is  now 

re-evaluated  at  the  channel  number  of  the  (jmax-1) peak  and  the  number 

of  mass  units  to  the  next  peak  on  the  list  is  computed.  The  latter 

number  is  subtracted  from  the  mass  number  of  the  (jmax-1)^^  peak  to 

th 

obtain  the  mass  number  for  the  (jmax-2)  peak.  This  procedure  is 
iterated  until  the  mass  number  of  the  first  peak  on  the  list  is  computed. 
The  analysis  now  returns  to  the  (jmax)^^  peak  and  precedes  iteratively 
toward  the  end  of  the  list  until  the  last  peak  is  assigned  a mass  number. 

Peak  areas  are  computed  by  summing  the  numbers  that  lie  in  the 
channel  range  whose  boudaries  are  the  minima  that  lie  to  the  right  and 
to  the  left  of  the  peak  maximum.  To  avoid  integrating  the  areas  of 
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1 


individual  unsmoothed  ripples,  the  algorithm  for  finding  the  boundary 
minima  is  limited  to  small  ranges  that  are  centered  at  1/2  mass  unit 
from  the  peak  maximum. 

There  are  two  options  for  subtracting  noise;  "noise”,  as  defined 
above  can  be  multiplied  by  the  number  of  channels  that  lie  between  the 
two  peak  minima  and  subtracted  from  the  peak  area  or  the  computer  can 
perform  the  equivalent  of  drawing  a line  tangent  to  the  two  minima  and 
subtracting  that  portion  of  the  peak  area  that  lies  below  the  line, 

"PKFIND"  writes  a file,  called  PEAKS,  that  consists  of  NPEAK  + 1 records 
for  each  spectrum.  Each  record  consists  of  two  entries.  The  format  is 
as  follows:  (TAG,  NPEAK),  (A^ , m^)  , (A^ , ra^) , ...  '"npEAK^  ’ 

TAG  is  the  identification  number  for  the  spectrum  and  NPEAK  is  the  number 

of  peaks  in  the  list,  A.  is  the  area  of  the  jth  peak  whose  mass  number  is 

3 

m , 

j 

Notwithstanding  the  curve  smoothing  effect  of  the  least  squares  fit 
algorithm  and  the  modulation  requirement  in  the  peak  detection  algorithm, 
the  file  PEAKS  contains  some  ambiguous  entires  (viz  more  than  one  peak 
area  with  the  same  mass  assignment).  A program  called  "MASSTRAP"  detects 
ambiguous  entries  in  a spectrum  and  deletes  all  peak  areas  with  the  same 
mass  number  except  the  largest  one.  The  assumption  Implied  by  using  this 
program  is  that  the  smaller  peaks  correspond  to  bumps  on  the  "real"  mass 
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peak.  Examination  of  PEAKS  shows  that  in  cases  where  ambiguity  exists 
there  is  usually  a dominant  entry,  "MASSTBAP"  writes  a cleaned-up  version 
of  PEAKS  called  XPEAKS. 

The  degradation  in  resolution  with  increasing  ^nass  number  that  occurred 
R 

with  the  Colutron  and  quadrupole  mass  analyzers  presented  several  data 
analysis  problems.  Without  the  peak  width  test  of  equation  A6  there  is  no 
way  to  distinguish  between  fused  and  unfused  peaks.  For  example,  the  fusion 
of  peaks  may  exist  in  the  range  m > 280  in  one  spectrum  of  a given  oil 
while  in  another  spectrum  of  the  same  oil  it  may  extend  over  the  range 
m > 282.  In  a comparison  of  the  two  spectra  the  computer  would  compare 
a large  peak  at  m = 280  in  the  first  spectrum  to  a small  peak  in  the  second 

one  and  a peak  with  zero  area  at  m = 281  in  the  first  spectrum  to  a peak 
with  finite  area  in  the  second  one.  Furthermore,  a fused  peak  has  a 
maximum  that  lies  between  the  two  maxima  of  its  constituents  so  that  the 
systematic  degradation  of  resolution  with  increasing  mass  number  results 
in  an  apparent  nonlinearity  of  the  mass  scale. 

Oil  identification  by  mass  spectrometry  is  a pattern  recognition 
problem.  The  assignment  of  mass  numbers  to  peaks  is  merely  a means  of 
labeling  them  to  facilitate  the  comparison  of  spectra.  Therefore,  an 
apparent  nonlinearity  in  the  mass  scale  is  only  an  inconvenience.  In- 
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consistency  in  the  resolution  of  replicate  spectra,  however,  does  present 


a more  serious  problem.  In  view  of  the  large  numbers  of  peaks  available 
per  spectrum,  it  was  decided  to  use  a peak  width  test  and  to  limit  the 
characterization  of  an  oil  to  the  mass  numbers  at  which  peaks  were  well 
resolved  in  all  of  the  replicate  spectra  simultaneously. 

The  computer  program  "COMMASS"  reads  the  replicate  spectra  of  a 
given  oil  from  the  file  XPEAKS,  selects  the  mass  numbers  at  which  all  of 
the  spectra  have  peaks  and  writes  a file  for  that  class  in  which  each 
spectrum  contains  peaks  at  the  same  mass  numbers. 

A program  called  "STAT"  reads  class  files  (i.e. , sets  of  spectra  of 
a single  oil)  and  performs  the  following  computations  and  analyses: 

(1)  Each  spectrum  in  the  file  is  normalized  to  a total  peak 
area  of  unity. 

(2)  A normalized  class  average  spectrum  is  computed. 

(3)  The  relative  error  is  computed  for  each  mass  number  of  each 
spectrum 

® )/x.>  i = 1,2  • • • , NSPEC, 

ij  ij  j j A7 

j = 1,2  • • • , NPEAK , 
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where  e is  the  relative  error  of  the  ith  spectrum  evaluated  at  the  jth 
i j 

mass  number,  x. . is  the  normalized  peak  area  of  the  1th  spectrum  at  the 
Jth  mass  number  and  x.  Is  the  peak  area  at  the  jth  mass  number,  averaged 
over  all  the  spectra  in  the  class. 

(4)  For  each  spectrum  the  computer  plots  the  relative  error  versus 
the  peak  number  J. 

(5)  Graphs  generally  show  errors  that  monotonical ly  increase  or 
decrease  with  mass  number  at  rates  that  are  approximately  constant. 
Therefore,  regression  lines  are  computed  for  the  relative  errors 

of  each  spectrum  as  linear  functions  of  mass  number.  As  an 
approximation,  we  assume 


e.=A.  + Bm  4.r 
IJ  1 i j ij 


AS 


where  A and  B.  are  constants  whose  values  are  computed  in  the  regression 
analysis  and  r. . is  a random  error.  Equation  AS  represents  an  attempt  to 
resolve  the  relative  error  into  systematic  and  random  components. 


Combining  equations  A7  and  AS  and  solving  x obtains 

ij 


X =x  + x(A  +BM  + r ) 

ij  j J i i j ij 


The  transformation,  x'  =x  -x(A  -t-BM), 

ij  ij  j i i j 


yields 


A9 


AlO 


All 
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Therefore,  to  the  extent  that  equation  A8  is  a valid  approximation, 
the  transformed  spectrum  whose  jth  normalized  peak  area  is 
described  by  equation  All  differs  from  the  class  average 
spectrum  by  random  errors  only. 

It  should  be  noted  that  the  normalization  requirement  for  the 
'^ij  ^ additional  equation  in  the  regression  analysis, 

(6)  The  normalized  spectra  are  transformed  according  to  equation  AlO. 

(7)  A covariance  matrix  is  computed  for  the  variations  of  peak  areas 
about  their  respective  mean  values. 

(8)  The  hypothesis,  that  the  variations  of  peak  areas  about  their 
respective  means  are  distributed  normally,  is  tested. 

(9)  A correlation  coefficient  matrix  is  computed. 

(10)  The  hypothesis,  that  the  variations  in  peak  area  are  stochastically 
independent,  is  tested. 

(11)  The  computer  prints  the  results  of  the  two  tests,  the  normalized 
peak  areas  of  the  class  average  spectrum, the  standard  deviations 
for  peak  areas  at  each  mass  number  in  the  spectrum  and  the  standard 
deviation  averaged  over  all  peaks  in  the  spectrum. 

In  general,  after  the  correction  for  systematic  error  which  accounts 
for  approximately  10%  of  the  average  variance,  the  data  appear  to  be  noimally 
distributed  but  fail  to  pass  the  stochastic  independence  test. 
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